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METHOD AND SYSTEM FOR SELECTIVELY ACCESSING FILES 
ACCESSIBLE THROUGH A NETWORK 



BACKGROUND OF THE INVENTION 

Field of the Invention 

The present invention generally relates to a method and system for 
periodically searching through files accessible through a network, and in 
particular, to a method and system for searching through files accessible on a 
network during scheduled period searches of files based on data from files 
previously accessed. 

Description of the Related Art 
A network server maintains various files accessible across a network. In 
the case of the Internet, the files may comprises hypertext mark-up language 
(HTML) data, Common Gateway Interface (CGI) script, image files (e.g., jpg and 
.gif), and Channel Definition Format (CDF) files. Collectively, the files linked 
through HTML files produce a website, wherein the server acts as the website 
host. 

CDFs are small files which include data used by websites' "push" to 
specify how often and what parts of the site will be "pushed" (e.g., e-mailed) 
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directly to a registered subscriber. Based on the data in the CDF, the website will 
e-mail various information to the subscriber. 

A typical CDF file is an Extended mark-up language (XML) file. A CDF 
file contains various elements referred to as tags. Some tags include CHANNEL, 
ITEM, USERSCHEDULE, SCHEDULE, LASTMOD, and LEVEL. 

The CHANNEL tag has an HREF attribute that specifies the Universal 
Resource Locator (URL) on the website that corresponds to that CHANNEL. For 
example: 

<CHANNEL HREF= "http://www.mysite.com/Channel/homepage.htm"> 

The SCHEDULE tag indicates when a channel should be updated. For 
example: 

<SCHEDULE STARTDATE= "1999-09-23" STOPDATE = "1997-1 1-23"> 
<INTERVALTIME DAY = "17> 
<EARLISTTIME HOUR= "2" /> 
<LATESTTIME = "6" l> 
</SCHEDULE> 

indicates that the channel should be updated every day between the start 
date and the stop date between 2 and 6. 

Occasionally, a channel may have a subchannel. A subchannel refers to 
sub-sites on the website. A subchannel may appear as: 

<ITEM HREF= "foobar.htm" LASTMOD= "1999-01-01TO0101" 
LEVEL= "2"> 

<USAGE VALUE= "ScreenSaver"></USAGE> 

</ITEM> 
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A subchannel references a URL with information about when the page was 
last modified, and from this URL whether the information is relevant. 

A conventional search engine accesses websites on the network. The 
search engine downloads data from the website and archives selected downloaded 

5 data. The archived data is linked to the website from which it was downloaded. 

One can use the search engine to search for a particular website containing 
desirable information by entering a query into the search engine. The search 
engine will search its archived data and return websites in its archived database 
which relate to the query. 

1 o The dynamic nature of the Internet results in websites being updated 

regularly. Consequently, data which was on the website when the search engine 
initially visited the website may no longer be there. Alternatively, the data may be 
outdated. Further, the website may no longer exist or its URL may have changed. 
As a result, data archived by the search engine could become invalid. In order for 

1 5 the search engine to be a useful tool, the search engine must periodically update 
its archived data. 

A conventional search engine uses a web crawler (e.g., a "robot", "spider", 
"ant", etc.) to visit (i.e., access) a server on a network. The spider "crawls" from a 
homepage (i.e., the first or main webpage) of a website to the various subpages 
20 linked from the homepage. As the web crawler visits the various homepages with 

subpages, data on the pages are selectively archived by the search engine. 
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The typical crawlers visit web sites at regular intervals, for example, every 
30 days. If a web crawler accesses a website which has not been updated since the 
last time the web crawler visited, the web crawler would presume that the data 
previously archived is still valid. This may be erroneous. 

That is, one disadvantage with current web crawler technology is that the 
web crawler does not know when a website is scheduled to be updated. 
Depending on how often a website is updated, the web crawler's archived data 
could be very outdated by the time the web crawler returns. On the other hand, 
frequent web crawler visits to websites not frequently updated consumes valuable 
computer resources. 

SUMMARY OF THE INVENTION 

In view of the foregoing and other problems, an object of the present 
invention is to provide a method (and system) for determining when and how 
often a web crawler should return to a website. 

Another object is to provide a method (and system) for using the push 
channel definition available (e.g., a CDF) or other data on the website to 
determine how often to visit the website and what parts of the website to crawl 
based upon the information such as SCHEDULE and ITEM, available from the 
website. For example, this method can take advantage of a website's "last 
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updated," SCHEDULE, and ITEM information meant for "push" technology to 
automatically optimize when and how a web crawler crawls a website. 

The invention, in a first aspect thereof, is a method (and system) for 
searching files stored on a network. A first file is accessed on the network and 
data is downloaded from the first file. The accessing time to access a second file 
is set based on the data downloaded from the first file. In a further embodiment, 
the data from the first file is analyzed to determine when a second file is to be 
scheduled to be updated and the accessing time is assigned based on when the 
second file is scheduled to be updated. In an alternate, further embodiment, the 
method includes selecting a second file to download based on data downloaded 

from the first file. 

The invention, in a second aspect thereof, is a method (and system) for 
searching through files on an network. The method includes accessing a server on 
a network and downloading data from a first file. An accessing time to re-access 
the server is set based on data downloaded from the first file. In a further 
embodiment, the method includes accessing the server using the accessing time 
and downloading a second file from the server. In an alternate, further 
embodiment, the method includes selecting a second file to download based on 
data downloaded from the first file. 

The invention, in a third aspect thereof, is a system comprising a machine 
readable recording medium storing a program for searching through files 
accessible on a network. The program includes executable instructions for 
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accessing a first file on the network and downloading data from the first file. An 
accessing time to access a second file is set based on the data downloaded from 
the first file. In a further embodiment, the program includes accessing the server 
using the accessing time and downloading a second file from the server. In an 
alternate, further embodiment, the program includes selecting a second file to 
download based on data downloaded from the first file. 

With the present invention, a website can be "crawled" by using data 
previously collected from that website. For example, by using data in a CDF, the 
web crawler can be directed to crawl certain areas of the website at various 
intervals corresponding to when the website is scheduled to e-mail (i.e., "push") 
information to its subscribers. As a result, using the present invention, it is likely 
that a web crawler will encounter updated information on the website. 
Consequently, the present invention provides for a more efficient web crawling of 
a website by crawling the site when and where it is likely the information 
contained therein is updated. 
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BRIEF DESCRIPTION OF THE DRAWINGS 



The foregoing, and other objects, aspects, and advantages will be better 
understood from the following detailed description of a preferred embodiment of 
the invention with reference to the drawings, in which: 

Fig. 1 is a flow diagram illustrating a preferred method 100 of the 
invention; 

Fig. 2 is a schematic diagram of a system 200 for implementing a method 

of the present system; and 

Fig. 3 is a diagram of a readable recording medium for storing executable 

instructions. 

DETAILED DESCRIPTION OF PREFERRED 
EMBODIMENTS OF THE INVENTION 

Referring to Fig. 1, method 100 is directed to searching through files 
stored on a network. Method 100 includes accessing a first file on a network 
(Step 110). 

Data is downloaded from the first file (Step 120). This data is then 
analyzed (Step 130). If the first file is a CDF, analysis includes identifying 
various elements such as CHANNEL, SCHEDULE, and ITEM (Step 130). Next, 
values corresponding to the aforementioned elements are extracted from the 
downloaded data (Step 130). 
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An access time to access a second file is set using the SCHEDULE value 
(Step 140). As such, the access time will be set to correspond to when the web 
site is scheduled to be "pushed." A second file is selected to be downloaded 
based on the ITEM value (Step 150). In one embodiment, the second file selected 

5 is the same as the first file (Step 1 50). 

Method 100 can be implemented by a web crawler. An example of such 
an implementation may occur as follows. A web crawler is programmed to visit 
various websites which may contain CDFs. The web crawler is adapted to use the 
CDF information as a site map to determine which sub-websites to visit. 

1 o The first time the web crawler visits the website, the web crawler 

downloads the CDF file and keeps the site in a database, storing the CHANNEL 
and SCHEDULE information. Next, the web crawler uses the SCHEDULE 
information in the CHANNEL tag to decide when to visit the website next. 

In one embodiment, the next visit is normalized by the web crawler's own 

15 parameters as to when to crawl a site. For instance, if a web crawler has its own 
schedule and decides to crawl less frequently than the SCHEDULE value, it uses 
its own schedule than the web site's SCHEDULE value. 

When a web crawler visits a website on the web crawler's schedule, the 
web crawler may selectively visit sub-sites (e.g. items or subchannels) by using 

20 the LASTMOD and ITEM tags information in the CDF file to selectively crawl 

only those subchannels that have been or scheduled to be updated. It also uses the 
LEVEL attribute in any subchannel to see how deep to crawl. 
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An advantage of the present method is that using the SCHEDULE and 
ITEMS values provides for access only when a website and the website's 
associated files are scheduled to be updated. Consequently, a web crawler, 
utilizing this method, will access a website when the website is likely to have 
5 been updated, based on the CDF data. 

Further, the method does not require any work by the website builder (e.g., 
web master) to accommodate the web crawler. The web crawler automatically 
uses the "push" information already available. 

Referring now to Fig. 2, system 200 illustrates a typical hardware 
1 0 configuration of a processing method 100. Preferably, system 200 has at least one 
processor or central processing unit (CPU) 21 1. The CPUs 21 1 are 
interconnected via a system bus 212 to a random access memory (RAM) 214, 
read-only memory (ROM) 216, input/output (I/O) adapter 218 (for connecting 
peripheral devices such as disk units 221 and tape drives 240 to the bus 212), user 
1 5 interface adapter 222 (for connecting a keyboard 224, mouse 226, speaker 228, 

microphone 232, and/or other user interface device to the bus 212), a 
communication adapter 234 for connecting an information handling system to a 
data processing network, the Internet, an Intranet, a personal area network (PAN), 
or other similar information systems, and a display adapter 236 for connecting the 
20 bus 212 to a display device 238. Further, an automated reader/scanner 240 may be 
included. Such readers/scanners are commercially available from many sources. 
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In addition to the hardware/software environment described above, a 
different aspect of the invention includes a computer-implemented method for 
performing the above method. As an example, this method may be implemented 
in the particular environment discussed above. 
5 Such a method may be implemented, for example, by operating the CPU 

211 (Fig 2), to execute a sequence of machine-readable instructions. These 
instructions may reside in various types of signal-bearing media. 

Thus, this aspect of the present invention is directed to a programmed 
product, comprising signal-bearing media tangibly embodying a program of 
10 machine-readable instructions executable by a digital data processor incorporating 
the CPU 21 1 and hardware above, to perform the method of the invention. 

This signal-bearing media may include, for example, a RAM contained 
within the CPU 21 1, as represented by the fast-access storage for example. 
Alternatively, the instructions may be contained in another signal-bearing media, 
1 5 such as a magnetic data storage diskette 300 (Fig. 3), directly or indirectly 
accessible by the CPU 21 1. 

Whether contained in the diskette 300, the computer/CPU 21 1 , or 
elsewhere, the instructions may be stored on a variety of machine-readable data 
storage media, such as DASD storage (e.g., a conventional "hard drive" or a RAID 
20 array), magnetic tape, electronic read-only memory (e.g., ROM, EPROM, or 

EEPROM), an optical storage device (e.g. CD-ROM, WORM, DVD, digital 
optical tape, etc.), paper "punch" cards, or other suitable signal-bearing media 
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including transmission media such as digital and analog and communication links 
and wireless. In an illustrative embodiment of the invention, the machine-readable 
instructions may comprise software object code, compiled from a language such 
as "C", etc. 

There are several advantages to the present invention. A major advantage 
is the invention's ability to "screen" websites. As the number of pages on the web 
grows (conceivably to well beyond 1 billion), it is impossible for search engines to 
keep up to date with all of these pages. The present invention provides a method 
and system which allows search engines to visit the pages that are the most 
recently updated, and to not visit those web pages that have not been updated. 

Another advantage of the present invention is that it is not limited to CDF 
files only. It can work with any sitemap structure that a website provides with 
"change dates." For instance, Netscape uses a different format based on Resource 
Description Framework (RDF). 

While the invention has been described in terms of preferred 
embodiments, those skilled in the art will recognize that the invention can be 
practiced with modification within the spirit and scope of the appended claims. 
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CLAIMS 



What is claimed is: 

1 method for searching files stored on a network, comprising: 

2 accessing a first file on the network; 

3 downloading data from the first file; and 

4 setting an accessing time to access a second file based on said data 

5 downloaded from the first file. 

1 2. The method of claim 1 , wherein the second file is the same as the first file. 

1 3 . The method of claim 1 , further comprising selecting a second file to download 

2 based on said data downloaded from the first file. 

1 4. The method of claim 1 , wherein the first file comprises a channel definition 

2 file (CDF). 

1 5. The method of claim 1 , wherein said setting an accessing time comprises: 

2 analyzing the data from the first file to determine when a second file is 

3 scheduled to be updated; and 
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4 assigning the accessing time based on when the second file is scheduled to 

5 be updated. 

1 6. The method of claim 3, wherein said setting an accessing time comprises; 

2 analyzing the data from the first file to determine when a second file is 

3 scheduled to be updated; and 

4 assigning the accessing time based on when the second file is scheduled to 

5 be updated. 

1 ^ A method for searching files on a network, comprising: 

2 accessing a server on the network; 

3 downloading data from a first file; and 

4 setting an accessing time to re-access the server based on said data 

5 downloaded from the first file. 

1 8, The method of claim 7, further comprising: 

2 accessing the server using the accessing time; and 

3 downloading a second file from the server. 

1 9. The method of claim 8, wherein the second file is the same as the first file. 
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1 10. The method of claim 7, further comprising selecting a second file to 

2 download based on said data downloaded from the first file. 

1 11. The method of claim 8, further comprising selecting a second file to 

2 download based on said data downloaded from the first file. 

1 12. The method of claim 7, wherein the first file comprises a channel definition 

2 file (CDF). 

1 13. The method of claim 7, wherein said setting an accessing time comprises: 

2 analyzing the data from the first file to determine when a second file is 

3 scheduled to be updated; and 

4 assigning the accessing time based on when the second file is scheduled to 

5 be updated. 

1 14. The method of claim 1 3, wherein the accessing time is after the scheduled 

2 update of the second file. 

1 15. The method of claim 8, wherein said setting an accessing time comprises: 

2 analyzing the data from the first file to determine when a second file is 

3 scheduled to be updated; and 
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5 



assigning the accessing time based on when the second file is scheduled to 
be updated. 



1 16. The method of claim 10, wherein setting an accessing time comprises: 

2 analyzing the data from the first file to determine when a second file is 

3 scheduled to be updated; and 

4 assigning the accessing time based on when the second file is scheduled to 

5 be updated. 

1 A system comprising a machine readable recording medium storing a 

2 program for searching through files stored on a network, said program including 

3 executable instructions for: 

4 accessing a first file on the network; and 

5 downloading data from the first file; and 

6 setting an accessing time to access a second file based on said data 

7 downloaded from the first file. 

1 18. The system of claim 17, wherein the second file is the same as the first file. 

1 19. The system of claim 17, further comprising selecting a second file to access 

2 based on said data downloaded from the first file. 
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20. The system of claim 17, wherein the first file comprises a channel definition 
file (CDF). 



21 . The system of claim 17, wherein setting an accessing time comprises: 

analyzing the data from the first file to determine when a second file is 
scheduled to be updated; and 

assigning the accessing time based on when the second file is scheduled to 
be updated. 

22. The system of claim 19, wherein setting an accessing time comprises: 

analyzing the data from the first file to determine when a second file is 
scheduled to be updated; and 

assigning the accessing time based on when the second file is scheduled to 
be updated. 



26. A system for searching files stored on a network, comprising: 
means for accessing a first file on the network; 
means for downloading data from the first file; and 
means for setting an accessing time to access a second file based on said 
data downloaded from the first file. 
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METHOD AND SYSTEM FOR SELECTIVELY ACCESSING FILES 
ACCESSIBLE THROUGH A NETWORK 



ABSTRACT 



A method (and system) for periodically searching through files accessible 
through a network, at an interval based on previously accessed data. The method 
includes accessing and download data from a first file on the network. An 
accessing time is set to access a second file on the network based on the data 
downloaded from the first file. In one further embodiment, the first file is a 
Channel Definition Format (CDF) file. 
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